# Prompt Caching 与 Context 管理

> **一句话摘要**：三级缓存作用域（global/org/ephemeral）+ 静态-动态分界 + 锁存模式 + Auto-Compact 三级降级策略，实现了 prompt cache 命中率最大化和上下文窗口的高效利用。

> 核心文件：`src/services/api/claude.ts`、`src/utils/api.ts`、`src/services/compact/`

## 一、Prompt Caching 完整策略

### 1.1 三级缓存作用域

| 作用域 | 含义 | 使用场景 |
|--------|------|---------|
| `global` | 跨用户/跨组织共享 | System prompt 静态区 |
| `org` | 组织级共享 | System prompt + tool schemas |
| `ephemeral`（无 scope） | 请求级 | 消息级缓存标记 |

### 1.2 Cache TTL 策略

> [!tip] 锁存（Latch）模式
> TTL 资格在会话开始时锁定，避免中途超额状态变化导致 TTL 切换破坏缓存。这种"一旦决定就不变"的策略也应用于 Beta Header。

两种 TTL：
- **5 分钟**：默认（API 默认行为）
- **1 小时**：符合条件的用户（ant 用户或付费订阅且未超额）

**锁存（Latch）机制**：TTL 资格在会话开始时锁定，避免中途超额状态变化导致 TTL 切换破坏缓存。

### 1.3 System Prompt 缓存分割

`splitSysPromptPrefix()` 将 system prompt 拆分为多个 block：

**全局缓存模式（有 boundary marker）**：
```
Block 1: Attribution Header          → cacheScope=null（不缓存）
Block 2: System Prompt Prefix        → cacheScope=null
Block 3: 静态内容（boundary 前）       → cacheScope='global'（跨用户缓存）
Block 4: 动态内容（boundary 后）       → cacheScope=null
```

**默认模式**：
```
Block 1: Attribution Header          → cacheScope=null
Block 2: System Prompt Prefix        → cacheScope='org'
Block 3: 其余内容                     → cacheScope='org'
```

### 1.4 消息级 Cache Breakpoints

每次请求只放置**一个**消息级 `cache_control` 标记，放在最后一条消息上：

```typescript
const markerIndex = skipCacheWrite 
  ? messages.length - 2  // fork: 放倒数第二条（共享缓存前缀）
  : messages.length - 1  // 正常: 放最后一条
```

### 1.5 Cached Microcompact（缓存编辑）

使用 API 的 `cache_edits` + `cache_reference` 在**不破坏现有缓存**的前提下删除旧 tool result：

```
1. tool_result block 上添加 cache_reference: tool_use_id
2. 需要清理时，发送 cache_edits block 删除指定 cache_reference
3. cache_edits 被 "pinned" 到原始位置，后续请求重复发送
```

### 1.6 Cache Break Detection（缓存破坏检测）

两阶段诊断系统：

**Phase 1（pre-call）**：记录状态哈希值
- System prompt 哈希、Tool schemas 哈希、cache_control 哈希、Model、Betas 等

**Phase 2（post-call）**：
- 比较 `cache_read_input_tokens` 变化
- 下降超过 5% 且绝对值 > 2000 tokens → 判定 cache break
- 用 Phase 1 的 pending changes 解释原因

### 1.7 Beta Header 锁存

动态 beta header 一旦首次发送，在整个 session 内保持发送，防止中途切换改变 cache key。

## 二、Context Window 管理

### 2.1 Context Window 大小

| 模型 | 默认窗口 |
|------|---------|
| 标准模型 | 200,000 tokens |
| 1M 支持模型 | 1,000,000 tokens |
| 环境变量覆盖 | `CLAUDE_CODE_MAX_CONTEXT_TOKENS`（ant） |

### 2.2 Auto-Compact 阈值体系

以 200K 模型为例：

```
Context Window            200,000
- 预留输出空间              -20,000
= 有效窗口                 180,000
- Auto-compact 缓冲         -13,000
= Auto-compact 阈值        167,000 tokens ← 超过此值触发压缩
- 警告阈值缓冲              -20,000
= 警告阈值                 147,000 tokens ← 超过此值显示警告
```

### 2.3 Token 计数

**核心函数** `tokenCountWithEstimation()`：
1. 从后往前找最后一个有 API usage 的 assistant 消息
2. 使用 `input_tokens + cache_creation + cache_read + output_tokens` 作为基准
3. 加上之后新消息的粗估（4 字符 ≈ 1 token）

### 2.4 Auto-Compact 熔断机制

> [!warning] 熔断器防止无限重试
> 连续失败超过 3 次后停止重试。BQ 数据显示 1,279 个 session 有 50+ 次失败，最高 3,272 次，每天浪费约 250K API 调用。

```
连续失败 > 3 次 → 停止重试
（BQ 数据：1,279 个 session 有 50+ 次失败，最高 3,272 次，每天浪费 ~250K API 调用）
```

## 三、/compact 完整实现

### 3.1 执行流程

```
/compact [可选指令]
    ├── 1. 获取 compact boundary 后的消息
    ├── 2. 尝试 Session Memory Compaction（无 LLM，瞬间完成）
    ├── 3. 否则走传统 compaction：
    │      ├── microcompact 先精简
    │      └── compactConversation() LLM 摘要
    └── 4. Post-compact 清理 + 上下文恢复
```

### 3.2 三级 Compact 策略

| 级别 | 方式 | 需要 LLM | 速度 |
|------|------|---------|------|
| **Session Memory Compact** | 使用已有 session memory 作为摘要 | ❌ | 瞬间 |
| **Microcompact** | 清理旧 tool result 内容 | ❌ | 瞬间 |
| **传统 Compact** | LLM 生成对话摘要 | ✅ | 数秒 |

### 3.3 Compact Prompt 设计

**完整 Preamble（NO_TOOLS_PREAMBLE）**：
```
CRITICAL: Respond with TEXT ONLY. Do NOT call any tools.

- Do NOT use Read, Bash, Grep, Glob, Edit, Write, or ANY other tool.
- You already have all the context you need in the conversation above.
- Tool calls will be REJECTED and will waste your only turn — you will fail the task.
- Your entire response must be plain text: an <analysis> block followed by a <summary> block.
```

**核心指令（BASE_COMPACT_PROMPT 开头）**：
```
Your task is to create a detailed summary of the conversation so far, 
paying close attention to the user's explicit requests and your previous actions.
This summary should be thorough in capturing technical details, code patterns, 
and architectural decisions that would be essential for continuing development 
work without losing context.
```

**9 个章节**（每个章节含子指令）：
1. **Primary Request and Intent** — 详尽捕获所有用户显式请求和意图
2. **Key Technical Concepts** — 列出所有重要技术概念、框架
3. **Files and Code Sections** — 列举文件并附**完整代码片段**，说明每次读/改的原因
4. **Errors and fixes** — 所有错误及修复，**特别关注用户反馈**
5. **Problem Solving** — 已解决和进行中的排障
6. **All user messages** — 列出所有非 tool result 的用户消息（**关键用于理解反馈和意图变化**）
7. **Pending Tasks** — 明确被要求但尚未完成的任务
8. **Current Work** — 详述紧接 compact 之前的工作，含文件名和代码片段
9. **Optional Next Step** — 下一步须**直接引用最近对话原文**（verbatim quotes），防止任务漂移

**自定义指令注入**：Prompt 支持用户通过 `compact_instructions` 注入额外指令，例如：
```
## Compact Instructions
When summarizing focus on typescript code changes and mistakes you made.
```

**完整 Trailer（NO_TOOLS_TRAILER）**：
```
REMINDER: Do NOT call any tools. Respond with plain text only — 
an <analysis> block followed by a <summary> block. 
Tool calls will be rejected and you will fail the task.
```

**Partial Compact 变体**：
- `PARTIAL_COMPACT_PROMPT`（direction='from'）：只摘要**选定消息之后**的部分
- `PARTIAL_COMPACT_UP_TO_PROMPT`（direction='up_to'）：只摘要**选定消息之前**的部分，第 8/9 章节替换为 "Work Completed" 和 "Context for Continuing Work"

### 3.4 Scratchpad CoT 模式

```xml
<analysis>
[模型的思考过程，确保覆盖所有要点]
</analysis>

<summary>
[最终摘要内容]
</summary>
```

**关键**：`formatCompactSummary()` 自动删除 `<analysis>` 部分，只保留 `<summary>`。这让模型充分思考（提高质量）但不让中间推理污染上下文（节省 tokens）。

### 3.5 三重工具禁止防护

1. **开头 preamble**："CRITICAL: Do NOT call any tools"
2. **结尾 trailer**："REMINDER: Do NOT call any tools"
3. **权限层拒绝**：`canUseTool → deny`

### 3.6 Post-Compact 上下文恢复

| 恢复内容 | 预算限制 |
|---------|---------|
| 最近访问的文件 | 最多 5 个，总 50K tokens，每个 5K |
| 已调用的 Skills | 25K tokens，每个 5K |
| Plan 文件 | 按需恢复 |
| 异步 Agent 状态 | 按需恢复 |
| MCP 指令 | 按需恢复 |

### 3.7 PTL 重试策略

当 compact 请求本身触发 prompt-too-long：
- 渐进式丢弃最老的 API-round groups
- 精确计算 token gap 或默认丢弃 20%
- 最多重试 3 次

## 四、Microcompact — 轻量级 Tool Result 清理

**可清理的工具**：Read, Bash, PowerShell, Grep, Glob, WebSearch, WebFetch, Edit, Write

**两种路径**：

1. **Time-Based**：缓存 TTL 过期后，旧 tool result 被替换为 `'[Old tool result content cleared]'`
2. **Cached MicroCompact**（ant）：用 API `cache_edits` 删除旧 tool result，不破坏缓存

## 五、API Context Management（服务端上下文管理）

`getAPIContextManagement()` 通过 `context_management.edits` 配置服务端上下文策略：

### 5.1 Thinking 清理策略（`clear_thinking_20251015`）

```
hasThinking && !isRedactThinkingActive → keep: 'all'（保留所有 thinking）
clearAllThinking（闲置 >1h, cache 已 miss）→ keep: { type: 'thinking_turns', value: 1 }
```

### 5.2 Tool Result 清理策略（`clear_tool_uses_20250919`，ant-only）

两种模式，均通过环境变量启用：

| 模式 | 环境变量 | 行为 |
|------|---------|------|
| **clear_tool_inputs** | `USE_API_CLEAR_TOOL_RESULTS` | 清理指定工具的 tool result 内容 |
| **exclude_tools** | `USE_API_CLEAR_TOOL_USES` | 清理除指定工具外的整个 tool_use |

**可清理 result 的工具**：Bash, PowerShell, Glob, Grep, Read, WebFetch, WebSearch
**排除清理的工具**：Edit, Write, NotebookEdit（保留编辑上下文）

触发阈值：`input_tokens > 180,000`，清理目标：保留最后 40,000 tokens

### 5.3 Context Collapse（特性标志 `CONTEXT_COLLAPSE`）

Context Collapse 是 autocompact 的替代方案，通过**细粒度折叠**（而非单次摘要）管理上下文窗口。启用时：
- **抑制 proactive autocompact**（autocompact 不再主动触发）
- Collapse 自身的 90% 提交 / 95% 阻塞流程接管 headroom 管理
- reactive compact 保留为 413 PTL 的后备手段
- 手动 `/compact` 和 session memory compact 仍然可用

**Read-time Projection 架构**（`query.ts`）：

```
// Nothing is yielded — the collapsed view is a read-time projection
// over the REPL's full history. Summary messages live in the collapse
// store, not the REPL array. This is what makes collapses persist
// across turns: projectView() replays the commit log on every entry.
```

关键设计：折叠后的视图不修改原始消息数组，而是每次查询时动态投影——类似数据库的 materialized view。这保证了折叠可逆，且原始对话历史完整保留。

**Overflow Recovery 流程**（`query.ts:1090-1120`）：当 API 返回 413 PTL 时，Context Collapse 优先执行 `recoverFromOverflow()` 排空 staged collapses；若已排空仍 413，则 fallthrough 到 reactive compact。两层恢复共享同一个 withholding 机制——withheld 的 PTL 错误不会直接暴露给用户，而是由恢复层消化后重试。

## 六、关键设计亮点

1. **静态/动态分界标记**：`SYSTEM_PROMPT_DYNAMIC_BOUNDARY` 将可全局缓存的内容与动态内容分离
2. **锁存模式**：TTL 决策、beta header 一旦确定就不变，避免 cache bust
3. **Cache-Sharing Fork**：compact 使用 `runForkedAgent` 复用主对话的 cache prefix
4. **熔断器**：连续失败 3 次后停止重试
5. **Post-Compact 恢复**：自动恢复最近文件和 skills，最小化信息丢失
6. **路径规范化**：用 `$TMPDIR` 替代实际路径，避免用户差异破坏全局缓存
7. **Image stripping**：compact 前自动将 image/document block 替换为 `[image]`/`[document]` 文本标记，避免压缩请求本身 PTL
8. **Transcript 归档**：compact 前将被压缩的消息写入 session transcript 文件（KAIROS 特性）

## 七、实践启示

1. **缓存感知开发**：CC 极度依赖 prompt cache，任何改变 system prompt / tool schema 的操作都会 bust cache（~50-70K tokens 的 cache_creation 成本）。Beta header 锁存是为此设计的
2. **Compact Prompt Engineering**：9 章节模板 + `<analysis>` scratchpad 是「让模型充分思考但不浪费 context」的范例。analysis 被事后删除，只保留 summary
3. **三级压缩设计**：Session Memory → Microcompact → LLM Compact 的渐进式策略，最大化避免昂贵的 LLM 调用
4. **Cached Microcompact 的精妙**：使用 `cache_reference` + `cache_edits` 在 KV cache 层面删除旧 tool result，无需重建缓存前缀。这是「不破坏缓存的同时释放空间」的典型方案
5. **PTL 重试策略**：compact 请求本身可能也会 PTL，解法是渐进式丢弃最老的 API-round groups 并重试（最多 3 次）
6. **Cache Break Detection 启示**：CC 持续监控 cache_read 下降，并关联 Phase 1 记录的状态变化来自动归因。这种「pre/post 快照对比」模式适用于任何需要诊断性能回归的系统